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The article proposes an expert system for detection, and subsequent investigation, of groups of collaborating automobile insurance 
fraudsters. The system is described and examined in great detail, several technical difficulties in detecting fraud are also considered, 
for it to be applicable in practice. Opposed to many other approaches, the system uses networks for representation of data. Networks 
are the most natural representation of such a relational domain, allowing formulation and analysis of complex relations between 
entities. Fraudulent entities are found by employing a novel assessment algorithm. Iterative Assessment Algorithm (lAA), also 
presented in the article. Besides intrinsic attributes of entities, the algorithm explores also the relations between entities. The 
prototype was evaluated and rigorously analyzed on real world data. Results show that automobile insurance fraud can be efficiently 
detected with the proposed system and that appropriate data representation is vital. 

Key words: Fraud detection. Automobile insurance. Social network analysis. Link analysis. Assessment propagation 
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i<l. Introduction 

Fraud is encountered in a variety of domains. It comes in all 
different shapes and sizes, from traditional fraud, e.g. (simple) 
I tax cheating, to more sophisticated, where entire groups of indi- 
viduals are collaborating in order to commit fraud. Such groups 
can be found in the automobile insurance domain. 

Here fraudsters stage traffic accidents and issue fake insur- 
ance claims to gain (unjustified) funds from their general or 
■vehicle insurance. There are also cases where an accident has 
'never occurred, and the vehicles have only been placed onto the 
road. Still, the majority of such fraud is not planned {oppor- 
'tunistic fraud) - an individual only seizes the opportunity aris- 
ing from the accident and issues exaggerated insurance claims 
or claims for past damages. 

■ Staged accidents have several common characteristics. They 
occur in late hours and non-urban areas in order to reduce the 
probability of witnesses. Drivers are usually younger males, 
-there are many passengers in the vehicles, but never children 
or elders. The police is always called to the scene to make the 
subsequent acquisition of means easier. It is also not uncom- 
mon that all of the participants have multiple (serious) injuries, 
when there is almost no damage on the vehicles. Many other 
suspicious characteristics exist, not mentioned here. 

The insurance companies place the most interest in organized 
groups of fraudsters consisting of drivers, chiropractors , garage 
mechanics, lawyers, police officers, insurance workers and oth- 
ers. Such groups represent the majority of revenue leakage. 



Most of the analyses agree that approximately 20% of all in- 
surance claims are in some way fraudulent (various resources). 
But most of these claims go unnoticed, as fraud investigation 
is usually done by hand by the domain expert or investigator 
and is only rarely computer supported. Inappropriate represen- 
tation of data is also common, making the detection of groups 
of fraudsters extremely difficult. An expert system approach is 
thus needed. 
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JensenI ( 11997b has observed several technical difficulties in 
detecting fraud (various domains). Most hold for (automobile) 
insurance fraud as well. Firstly, only a small portion of acci- 
dents or participants is fraudulent {skewed class distribution) 
making them extremely difficult to detect. Next, there is a se- 
vere lack of labeled data sets as labeling is expensive and time 
consuming. Besides, due to sensitivity of the domain, there is 
even a lack of unlabeled data sets. Any approach for detecting 
such fraud should thus be founded on moderate resources (data 
sets) in order to be applicable in practice. Fraudsters are very in- 
novative and new types of fraud emerge constantly. Hence, the 
approach must also be highly adaptable, detecting new types 
of fraud as soon as they are noticed. Lastly, it holds that fully 
autonomous detection of automobile insurance fraud is not pos- 
sible in practice. Final assessment of potential fraud can only 
be made by the domain expert or investigator, who also deter- 
mines further actions in resolving it. The approach should also 
support this investigation process. 

Due to everything mentioned above, the set of approaches 
for detecting such fraud is extremely limited. We propose a 
novel expert system approach for detection and subsequent in- 
vestigation of automobile insurance fraud. The system is fo- 
cused on detection of groups of collaborating fraudsters, and 
their connecting accidents (non-opportunistic fraud), and not 
some isolated fraudulent entities. The latter should be done in- 
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dependently for each particular entity, while in our system, the 
entities are assessed in a way that considers also the relations 
between them. This is done with appropriate representation of 
the domain - networks. 

Networks are the most natural representation of any rela- 
tional domain, allowing formulation of complex relations be- 
tween entities. They also present the main advantage of our sys- 
tem against other approaches that use a standard ^af data form. 
As collaborating fraudsters are usually related to each other in 
various ways, detection of groups of fraudsters is only possible 
with appropriate representation of data. Networks also provide 
clear visualization of the assessment, crucial for the subsequent 
investigation process. 

The system assesses the entities using a novel Iterative As- 
sessment Algorithm {lAA algorithm), presented in this article. 
No learning from initial labeled data set is done, the system 
rather allows simple incorporation of the domain knowledge. 
This makes it applicable in practice and allows detection of new 
types of fraud as soon as they are encountered. The system can 
be used with poor data sets, which is often the case in practice. 
To simulate realistic conditions, the discussion in the article and 
evaluation with the prototype system relies only on the data and 
entities found in the pohce record of the accident (main entities 
are participant, vehicle, collisiorQ, police officer). 

The article makes an in depth description, evaluation and 
analysis of the proposed system. We pursue the hypothesis that 
automobile insurance fraud can be detected with such a system 
and that proper data representation is vital. Main contributions 
of our work are: (1) a novel expert system approach for the 
detection of automobile insurance fraud with networks; (2) a 
benchmarking study, as no expert system approach for detec- 
tion of groups of automobile insurance fraudsters has yet been 
reported (to our knowledge); (3) an algorithm for assessment of 
entities in a relational domain, demanding no labeled data set 
{lAA algorithm); and (4) a framework for detection of groups 
of fraudsters with networks (applicable in other relational do- 
mains). 

The rest of the article is organized as follows. In section |2] 
we discuss related work and emphasize weaknesses of other 
proposed approaches. Section|3]presents formal grounds of (so- 
cial) networks. Next, in section |4] we introduce the proposed 
expert system for detecting automobile insurance fraud. The 
prototype system was evaluated and rigorously analyzed on real 
world data, description of the data set and obtained results are 
given in section |5] Discussion of the results is conducted in 
section |6] followed by the conclusion in section [T] 

2. Related work 

Our work places in the wide field of fraud detection. Fraud 
appears in many domains including telecommunications, 
banking, medicine, e-commerce, general and automobile 



insurance. Thus a number of expert system approaches 
for preventing, detecting and investigating fraud have been 
developed in the past. Researches have proposed using some 
standard methods of data mining and machine learning, neural 
networks, fuzzy logic, genetic algorithms, support vector 
machines, (logistic) regression, consolidated (classification) 
trees, approaches over red-flags or proflles, various sta t istical 
methods and o t her me t hods and approa c hes ( Artis et aO 2002 
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Rupniketal., 2007; Ouah & Sriaanesh, 2008; Sanchez et al 
2009 ; Viaeneet aL 2002, 2005; Weisberg & Derrig, ,19981 
Ya ng & Hwangl 2006 ). Analyses sho w that in practice none 
is significan t ly be tter than others (IBolton & Hand! 12002 : 



Viaene et all l2005h . Furthermore, they mainly have three 
weaknesses. They (1) use inappropriate or inexpressive 
representation of data; (2) demand a labeled (initial) data set; 
and (3) are only suitable for larger, richer data sets. It turns 
out that th ese are generally a problem whe n dealing with fraud 
detection (|jensenLll997l: IPhua et alll2005h . 

In the narrower sense, our work comes near the ap- 
proaches from the field of network analysis, that combine 
intrinsic attribut es of entities with their relational attributes. 
iNoble & Good ( 2003 ) proposed detecting anomalies in net- 
works with various types of vertices, but they focus on de- 
tecting suspicious structures in the network, not vertices (i.e. 
entities). Besides that, the approach is more appropriate for 
larger networks. Researchers also proposed detecting an oma- 
lies using measures o f centrality ("Freeman, '1977', '1979"), ran- 
dom walks dSiiii et all 12005 ) and other ( Holder & Gook, ,200l 
Maxion & Tan , 2000h . but these approaches mainly rely only 



on the relational attributes of entities. 

Many researchers have investigated the problem of clas- 
sification in the relational context, following the hypoth- 
esis that classification of an entity can be improved by 
also considering its related entities (inference). Thus many 
approaches formulating inference, spread or propagation 
on networks have be en developed in va rious fields of re- 
search (,Brin&Pag 5 1 199 81; lOomingos & Richards on! 12001 
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Thi'oughout the article the term colhsion is used instead of (traffic) acci- 
dent. The word accident implies there is no one to blame, which contradicts 
with the article. 



Minkal 12001c iNeville & Jensenl |2000). Most 
of them are based on one of the three most popular 
(appr oximate) inference alg orithms: Relaxation Labeling 
(RL) dHummel & Zucken,ll983i) from the computer vision com- 
munity. Loopy Beli ef Propagation (LBP) on loo py (Bayesian) 
graphical models (IKschischang & Frevi Il998h and Iterative 
Classific ation Algo rithm (ICA) from the data mining commu- 
nity ( Neville & Jensen. 2000). For the a nalyses and comparison 
see dKempe et alll2003t1sen & GetooilEoOTi) . 

Researchers have reported good results with th ese algorithms 
dBrin & Page, 1998; Ksc hischan g & Frey, J998i ILu & Getooi , 
l2003bt INeville & .TensenL Ijoool) . however they mainly address 
the problem of learning from an (initial) labeled data set (super- 
vised leaniin£}^_OT a parti ally labeled (semi-supervised learn- 
ing) (ILu & Getoorl l2003ah . therefore the approaches are gen- 
erally inappropriate for fraud detection. The algorithm we in- 
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troduce here, lAA algorithm, is almost identical to the ICA al- 
gorithm, however it was developed with different intentions in 
mind - to assess the entities when no labeled data set is at hand 
(and not for improving classification with inference). Further- 
more, lAA does not address the problem of classification, but 
ranking. Thus, in this way, it is actually a simplification o f 
RL algorithm, or even Google's PageRank (,Brin & Pagelll998l) . 
still it is not founded on the probability theory hke the latter. 

We conclude that due to the weaknesses mentioned, most of 
the proposed approaches are inappropriate for detection of (au- 
tomobile) insurance fraud. Our approach differs, as it does not 
demand a labeled data set and is also appropriate for smaller 
data sets. It represents data with networks, which are one of the 
most natural representation and allow complex analysis without 
simplification of data. It should be pointed out that networks, 
despite thek strong foundations and expressive power, have not 
yet been used for detecting (automobile) insurance fraud (at 
least according to our knowledge). 




Fig. 1: (a) simple graph with directed edges; (b) undirected multigraph with la- 
beled vertices and edges (labels are represented graphically); (c) network repre- 
senting collisions where round vertices correspond to participants and cornered 
vertices correspond to vehicles. Collisions are represented with directed edges 
between vehicles. 



3. (Social) networks 

Networks are based upon mathematical objects called 
graphs. Informally speaking, graph consists of a collection of 
points, called vertices, and links between these points, called 
edges (Fig.[T]|. Let Vq, Eg be a set of vertices, edges for some 
graph G respectively. We define G as G = (Vc, £c) where 

Vc = {vi,V2...v„}, (1) 
Eg £ {{vi,vj]\vi,vjeVcAii^j}. (2) 

Note that edges are sets of vertices, hence they are not directed 
(undirected graph). In the case of directed graphs equation (|2]l 
rewrites to 

Ec c {(vi,Vj)\vi,vj&VcAi*j], (3) 

where edges are ordered pairs of vertices - (v,, v^) is an edge 
from V, to Vj. The definition can be further generalized by al- 
lowing multiple edges between two vertices and loops (edges 
that connect vertices with themselves). Such graphs are called 
multigraphs. Examples of some simple (multi)graphs can be 
seen in Fig.[T] 

In practical applications we usually strive to store some extra 
information along with the vertices and edges. Formally, we 
can define two labeling functions 

Iv^: Vc^^va, (4) 
ho ■ Ec -> Sfig, (5) 

where Sv^, I,Eg (finite) alphabets of all possible vertex, edge 
labels respectively. Labeled graph can be seen in Fig.[T](b). 

We proceed by introducing some terms used later on. Let 
G be some undirected multigraph or an underlying graph of 
some directed multigraph - underlying graph consists of same 
vertices and edges as the original directed (multi)graph, only 
that all of its edges are set to be undirected. G naturally parti- 
tions into a set of (connected) components denoted C(G). E.g. 
all three graphs in Fig. [T]have one connected component, when 



graphs in Fig.|2]consist of several connected components. From 
here on, we assume that G consists of a single connected com- 
ponent. 

Let Vj be some vertex in graph G, v,- e Vq. Degree of the 
vertex v,, denoted d(vj), is the number of edges incident to it. 
Formally, 

d(vd = \{e\eeEcAviee]\. (6) 

Let Vj be some other vertex in graph G, vj e Vc, and let p(vi, vj) 
be a path between v, and vj. A path is a sequence of vertices 
on a way that leads from one vertex to another (including v,- 
and Vj). There can be many paths between two vertices. A 
geodesic g(vi, Vj) is a path that has the minimum size - consists 
of the least number of vertices. Again, there can also be many 
geodesies between two vertices. 

We can now define the distance between two vertices, i.e. v,- 
and Vj, as 

d(vi,Vj) = |^(v,-,v;)l-l. (7) 

Distance between v, and Vj is the number of edges visited when 
going from v,- to Vj (or vice versa). The diameter of some graph 
G, denoted d(G), is a measure for the "width" of the graph. 
Formally, it is defined as the maximum distance between any 
two vertices in the graph, 

d(G) = max{d(vi,Vj)\vi,VjeVG]. (8) 

All graphs can be divided into two classes. First are cyclic 
graphs, having a path p(vi, v,) that contains at least two other 
vertices (besides v,) and has no repeated vertices. Such path 
is called a cycle. Graphs in Fig. [T](a) and (b) are both cyclic. 
Second class of graphs consists of acyclic graphs, more com- 
monly known as trees. These are graphs that contain no cycle 
(see Fig.[T](c)). Note that a simple undirected graph is a tree if 
and only if \Ec\ = |VgI - 1- 

Finally, we introduce the vertex cover of a graph G. Let S be 
a subset of vertices, S c Vc, with a property that each edge in 
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Ec has at least one of its incident vertices in S (covered by S). 
Such S is called a vertex cover It can be shown, that finding a 
minimum vertex cover is NP-hard in general. 

Graphs have been studied and investigated for almost 300 
years thus a strong theory has been developed until today. There 
are also numerous practical problems a nd apphcations whe re 
graphs have shown their usefulness (e.g. lBrin & Pagei 119981) - 
they are the most natural representation of many domains and 
are indispensable whenever we are interested in relations be- 
tween entities or in patterns in these relations. We emphasize 
this only to show that networks have strong mathematical, and 
also practical, foundation - network^ are usually seen as la- 
beled, or weighted, multigraphs with both directed and undi- 
rected edges (see Fig. [1] (c)). Furthermore, vertices of a net- 
work usually represent some entities, and edges represent some 
relations between them. When vertices correspond to people, 
or groups of people, such networks are called social networks. 

Networks often consist of densely connected subsets of ver- 
tices called communities. Formally, communities are subsets 
of vertices with many edges between the vertices within some 
community and o nly a few edges betw een the vertices of differ- 



ent communities. iGirvan & NewmanI (I200Z) suggested identi- 
fying communities by recursively removing the edges between 
them - between edges. As it holds that many geodesies run 
along such edges, where only few geodesies run along edges 
within communiti es, between edges can be removed by using 
edge betweenness ( IGirvan & Newmanll2002h . It is defined as 



Bet(ei) = \{g{vi, Vj)\ v,-, vj E Vc A 

Ag(v;,Vj) goes along e,}|. 



(9) 



where e, e Eq. The edge betweenness Betiei) is thus the num- 
ber of all geodesies that run along edge e, . 

For more details on (social) networks see e.g. ( NewmanI 
2003ll2008h. 



4. Expert system for detecting automobile insurance fraud 

As mentioned above, the proposed expert system uses (pri- 
marily constructed) networks of colhsions to assign suspicion 
score to each entity. These scores are used for the detec- 
tion of groups of fraudsters and their corresponding collisions. 
The framework of the system is structured into four modules 
(Fig^EJ. 

In the first module, different types of networks are con- 
structed from the given data set. When necessary, the networks 
are also simplified - divided into natural communities that ap- 
pear inside them. The latter is done without any loss of gener- 
ality. 

Networks from the first module naturally partition into sev- 
eral connected components. In the second module we inves- 
tigate these components and output the suspicious, focusing 
mainly on their structural properties such as diameter, cycles, 
etc. Other components are discarded at the end of this module. 
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-Throughout the article the terms graph and network are used as synonyms. 



Fig. 2: Framework of the proposed expert system for detecting (automobile 
insurance) fraud. 



Not all entities in some suspicious component are necessar- 
ily suspicious. In the third module components are thus further 
analyzed in order to detect key entities inside them. They are 
found by employing Iterative Assessment Algorithm ( lAA ), pre- 
sented in this article. The algorithm assigns a suspicion score 
to each entity, which can be used for subsequent assessment 
and analysis - to identify suspicious groups of entities and their 
connecting collisions. In general, suspicious groups are subsets 
of suspicious components. 

Note that detection of suspicious entities is done in two 
stages (second and third module). In the first stage, or the sec- 
ond module, we focus only on detecting suspicious components 
and in the second stage, third module, we also locate the sus- 
picious entities within them. Hence the detection in the first, 
second stage is done at the level of components, entities respec- 
tively. The reason for this hierarchical investigation is that early 
stages simplify assessment in the later stages, possibly without 
any loss for detection (for further implications see section|6l). 

It holds that fully autonomous detection of automobile in- 
surance fraud is not possible in practice. The obtained results 
should always be investigated by the domain expert or inves- 
tigator, who determines further actions for resolving potential 
fraud. The purpose of the last, fourth, module of the system is 
thus to appropriately assess and visualize the obtained results, 
allowing the domain expert or investigator to conduct subse- 
quent analysis. 

First three modules of the system are presented in sec- 
tions 14.11 14.21 14.31 respectively, when the last module is only 
briefly discussed in section l4!4l 
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4.1. Representation with networks 

Every entity's attribute is either intrinsic or relational. Intrin- 
sic attributes are those, that are independent of the entity's sur- 
rounding (e.g. person's age), while the relational attributes rep- 
resent, or are dependent on, relations between entities (e.g. re- 
lation between two colliding drivers). Relational attributes can 
be naturally represented with the edges of a network. Thus we 
get networks, where vertices correspond to entities and edges 
correspond to relations between them. Numerous different net- 
works can be constructed, depending on which entities we use 
and how we connect them to each other. 

The purpose of this first module of the system is to construct 
different types of networks, used later on. It is not immediately 
clear how to construct networks, that describe the domain in the 
best possible way and are most appropriate for our intentions. 
This problem arises as networks, despite their high expressive 
power, are destined to represent relations between only two en- 
tities (i.e. binary relations). As collisions are actually relations 
between multiple entities, some sort of projection of the data 
set must be made (for other suggestions see section]?). 

Collisions can thus be represented with various types of net- 
works, not all equally suitable for fraud detection. In our opin- 
ion, there are some guidelines that should be considered when 
constructing networks from any relational domain data (guide- 
lines are given approximately in the order of their importance): 

1. Intention: networks should be constructed so that they are 
most appropriate for our intentions (e.g. fraud detection) 

2. Domain: networks should be constructed in a way that de- 
scribes the domain as it is (e.g. connected vertices should 
represent some entities, also directly connected in the data 
set) 

3. Expressiveness: expressive power of the constructed net- 
works should be as high as possible 

4. Structure: structure of the networks should not be used for 
describing some specific domain characteristics (e.g. there 
should be no cycles in the networks when there are no actual 
cycles in the data set). Structural properties of networks are 
a strong tool that can be used in the subsequent (investiga- 
tion) process, but only when these properties were not artifi- 
cially incorporated into the network during the construction 
process 

5. Simplicity: networks should be kept as simple and sparse as 
possible (e.g. not all entities need to be represented by its 
own vertices). The hypothesis here is that simple networks 
would also allow simpler subsequent analysis and clearer fi- 
nal visualization (principle of Occam 's razo^ 

6. Uniqueness: every network should uniquely describe the 
data set being represented (i.e. there should be a bijection 
between different data sets and corresponding networks) 

Frequently all guidelines can not be met and some trade-off 
have to be made. 



In general there are (j) + + (Q) + Q)) = 10 possible ways 
how to connect three entities (i.e. collision, participant and ve- 
hicle), depending on which entities we represent with their own 
vertices. 7 of these represent participants with vertices and in 4 
cases all entities are represented by their own vertices. For the 
reason of simplicity, we focus on the remaining 3 cases. In the 
following we introduce four different types of such networks, 
as an example and for later use. All can be seen in Fig. |3] 
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^^The principle states that the explanation of any phenomenon should make 
as few assumptions as possible, eliminating those making no difference in the 
assessment - entities should not be multiplied beyond necessity. 



Fig. 3: Four types of networks representing same two collisions - (a) drivers 
network, (b) participants network, (c) COPTA network and (d) vehicles net- 
work. Rounded vertices correspond to participants, hexagons correspond to 
collisions and irregular cornered vertices correspond to vehicles. Solid directed 
edges represent involvement in some collision, solid undirected edges represent 
diivers (only for the vehicles network) and dashed edges represent passengers. 
Guilt in the collision is formulated with edge's direction. 



The simplest way is to only connect the drivers who were 
involved in the same collision - drivers networks. Guilt in the 
collision is formulated with edge's direction. Note that drivers 
networks severely lack expressive power (guideline |3)- We 
can therefore add the passengers and get participants networks, 
where passengers are connected with the corresponding drivers. 
Such networks are already much richer, but they have one major 
weakness - passengers "group" on the driver, i.e. it is generally 
not clear which passengers were involved in the same collision 
and not even how many passengers were involved in some par- 
ticular collision (guidelines |3]|6]l. 

This weakness is partially eliminated by COnnect Passengers 
Through Accidents networks (COPTA networks). We add spe- 
cial vertices representing collisions and all participants in some 
collision are now connected through these vertices. Passengers 
no longer group on the drivers but on the collisions, thus the 
problem is partially eliminated. We also add special edges be- 
tween the drivers and the collisions, to indicate the number of 
passengers in the vehicle. This type of networks could be ad- 
equate for many practical applications, but it should be men- 
tioned that the distance between two colliding drivers is now 
twice as large as before - the drivers are those that were directly 
related in the collision (guideline |2] ID . 

Last type of networks are vehicles networks where special 
vertices are added to represent vehicles. Collisions are now rep- 
resented by edges between vehicles, and driver and passengers 
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are connected through them. Such networks provide good vi- 
sualization of the collisions and also incorporate another entity, 
but they have many weaknesses as well. Two colliding drivers 
are very far apart and (included) vehicles are not actually of our 
interest (guideline |5]l. Such networks also seem to suggest that 
the vehicles are the ones, responsible for the collision (guide- 
line |2]l. Vehicles networks are also much larger than the previ- 
ous. 

A better way to incorporate vehicles into networks is sim- 
ply to connect collisions, in which the same vehicle was in- 
volved. Similar holds for other entities like police officers, chi- 
ropractors, lawyers, etc. Using special vertices for these enti- 
ties would only unnecessarily enlarge the networks and conse- 
quently make subsequent detection harder (guidelines [1] |5]l. It 
is also true, that these entities usually aren't available in practice 
(sensitivity of the domain). 

Summary of the analysis of different types of networks is 
given in table [1] 

Guidelines and networks 





drivers 


particip. 


COPTA 


vehicles 


Intention 


+ 


++ 
+ 


++ 


5 


Domain 








4 


Expressive. 
Structure 








+ 4 
4 


Simplicity 
Uniqueness 


-H 






3 
2 


Total 




-5 


4 

-1 


-9 
1 


-8 
-8 



Table 1 : Comparison of different types of networks due to the proposed guide- 
lines. Scores assigned to the guidelines are a choice made by the authors. Anal- 
ysis for Intention (guideline^, and total score, is given separately for second, 
third module respectively. 



There is of course no need to use the same type of networks 
in every stage of the detection process (guideline [!}. In the 
prototype system we thus use participants networks in the sec- 
ond module (section 143] ), as they provide enough information 
for initial suspicious components detection, and COPTA net- 
works in the third module (section 14. 3K whose adequacy will 
be clearer later Other types of networks are used only for vi- 
sualization purposes. Network scores, given in table [1] confirm 
this choice. 

After the construction of networks is done, the resulting con- 
nected components can be quite large (depending on the type of 
networks used). As it is expected that groups of fraudsters are 
relatively small, the components should in this case be simpli- 
fied. We suggest using edge betweenness (l Oirvan & Newman , 
|2002) to detect communities in the network (i.e. supersets of 
groups of fraudsters) by recursively removing the edges until 
the resulting components are small enough. As using edge be- 
tweenness assures that we would be removing only the edges 
between the communities, and not the edges within communi- 
ties, simplification is done without any loss for generality. 



4.2. Suspicious components detection 

The networks from the first module consist of several con- 
nected components. Each component describes a group of re- 
lated entities (i.e. participants, due to the type of networks 
used), where some of these groups contain fraudulent entities. 
Within this module of the system we want to detect such groups 
(i.e. fraudulent components) and discard all others, in order to 
simplify the subsequent detection process in the third module. 
Not all entities in some fraudulent component are necessarily 
fraudulent. The purpose of the third module is to identify only 
those that are. 

Analyses, conducted with the help of a domain expert, 
showed that fraudulent components share several structural 
characteristics. Such components are usually much larger than 
other, non-fraudulent components, and are also denser The un- 
derlying collisions often happened in suspicious circumstances, 
and the ratio between the number of collisions and the number 
of different drivers is usually close to 1 (for reference, the ra- 
tio for completely independent collisions is 2). There are ver- 
tices with extremely high degree and centrality. Components 
have a small diameter, (short) cycles appear and the size of the 
minimum vertex cover is also very small (all due to the size of 
the component). There are also other characteristics, all imply- 
ing that entities, represented by such components, are unusually 
closely related to each other Example of a fraudulent compo- 
nent with many of the mentioned characteristics is shown in 

Fig. a 




O 



Fig. 4: Example of a component of participants network with many of the sus- 
picious characteristics shared by fraudulent components. 



We have thus identified several indicators of likelihood that 
some component is fraudulent (i.e. suspicious component). The 
detection of suspicious components is done by assessing these 
indicators. Only simple indicators are used (no combinations 
of indicators). 

Formally, we define an ensemble of n indicators as / = 
. . .InY . Let c be some connected component in network 
G, c e C(G), and let //,(c) be the value for c of the characteris- 
tic, measured by indicator /, . Then 



hie) = 



1 c has suspicious value of //, 
otherwise 



(10) 



For the reason of simplicity, all indicators are defined as binary 
attributes. For the indicators that measure a characteristic that is 
independent of the structure of the component (e.g. number of 
vertices, collisions, etc.), simple thresholds are defined in order 
to distinguish suspicious components from others (due to this 
characteristic). These thresholds are set by the domain expert. 

Other characteristics are usually greatly dependent on the 
number of the vertices and edges in the component. A sim- 
ple threshold strategy thus does not work. Values of such //, 
could of course be "normalized" before the assessment (based 
on the number of vertices and edges), but it is often not clear 
how. Values could also be assessed using some (supervised) 
learning algorithm over a labeled data set, but a huge set would 
be needed, as the assessment should be done for each number 
of vertices and edges separately (owing to the dependence men- 
tioned). What remains is to construct random networks of (pre- 
sumably) honest behavior and assess the values of such charac- 
teristics using them. 

No in-depth analysis of collisions networks has so far been 
reported, and it is thus not clear how to construct such ran- 
dom networks. G eneral random network generators or mod - 
els, e.g. (Barabasi & Albert! Il999t lEppstein & Wangl l2002h . 
mainly give results far away from the collisions networks (vi- 
sually and by assessing different characteristics). Therefore a 
sort of rewiring algo rithm is employed, initi ally proposed by 
Ball et al.1 (Il997h and lWatts & Strogatj (Il998b . 

The algorithm iteratively rewires edges in some component 
c, meaning that we randomly choose two edges in Ec, {v,, vj} 
and {vii, V/), and switch one of theirs incident vertices. The re- 
sulting edges are e.g. {v,-, v/} and {v^, vj] (see Fig.|5]i. The num- 
ber of vertices and edges does not change during the rewiring 
process and the values for some //, can thus be assessed by gen- 
erating a sufficient number of such random networks (for each 
component). 




Fig. 5: Example of a rewired network. Dashed edges are rewired, i.e. replaced 
by solid edges. 



The details of the rewiring algorithm are omitted due to space 
limitations, we only discuss two aspects. First, the number of 
rewirings should be kept relatively small (e.g. < IZScI), otherwise 
the constructed networks are completely random with no trace 
of the one we start with - (probably) not networks represent- 
ing a set of collisions. We also want to compare components to 
other random ones, which are similar to them, at least in the as- 
pect of this rewirings. If a component significantly differs even 
from these similar ones, there is probably some severe anomaly 
in it. 

Second, one can notice that the algorithm never changes the 
degrees of the vertices. As we wish to assess the degrees as 
well, the algorithm can be simply adopted to the task in an 
ad hoc fashion. We add an extra vertex vv and connect all 



other vertices with it. As this vertex is removed at the end, 
rewiring one of the newly added edges with some other (old) 
edge changes the degrees of the vertices. Let {v/, vv), {vt, v/} be 
the edges being rewired and let {v,-, v;}, {vk, v^] be the edges af- 
ter the rewiring. The (true) degree of vertex v,-, vi^ was increased, 
decreased by one respectively. 

To assess the values of indicators we separately construct 
random components for each component c e C(G) and indica- 
tor /,■ e /, and approximate the distributions for characteristics 
Hj (Hi are seen as random variables). A statistical test is em- 
ployed to test the null hypothesis, if the observed value i/,(c) 
comes from the distribution for H,. The test can be one or two- 
tailed, based on the nature of characteristic //,. In the case of 
one-tailed test, where large values of Hi are suspicious, we get 



/,(c) = 



1 Pc(Hi>Hi{c))<ti 
otherwise 



(11) 



where probability density function P(Hi) is approximated with 
the generated distribution PdHi) and f, is a critical threshold 
or acceptable Type I error (e.g. set to 0.05). In the case of 
two-tailed test the equation (fTTT l rewrites to 



C PciHi > Hiic)) < till V 
Ii(c) = \ PAHi <Hi{c))< till 

I otherwise 



(12) 



Knowing the values for all indicators /, we can now indicate 
the suspicious components in C(G). The simplest way to ac- 
complish this is to use a majority classifier or voter, indicating 
all the components, for which at least half of the indicators is set 
to 1, as suspicious. Let ^(G) be a set of suspicious components 
in a network G, S (G) c C(G), then 



5(G) = {c|c€C(G)A^/,(c)>«/2). 



(13) 



When fraudulent components share most of the characteristics, 
measured by the indicators, we would clearly indicate them 
(they would have most, at least half, of the indicators set). Still, 
the approach is rather naive having three major weaknesses 
(among others). (1) there is no guarantee that the threshold n/2 
is the best choice; (2) we do not consider how many compo- 
nents have some particular indicator set; and (3) all indicators 
are treated as equally important. Normally, we would use some 
sort of supervised learning technique that eliminates this weak- 
nesses (e.g. regression, neural networks, classification trees, 
etc.), but again, due to the lack of labeled data and skewed class 
distribution in the collisions domain, this would only rarely be 
feasible (the size of C(G) is even much smaller then the size of 
the actual data set). 

To cope with the last two weaknesses mentioned, we sug- 
gest using p rincipal component analy sis of RIDITs (PRIDIT ) 
proposed bv lBrockett & Levind (Il977h (see dBrocketti [lisil l 
which has already been used for detecting fraudulent insurance 
claim files (Brpckett et al., ,2002), but not for detecting groups 
of fraudsters (i.e. fraud ulent compon ents). The RIDIT analysis 
was first introduced bv lBrossl(ll958l) . 
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RIDIT is basically a scoring method that transforms a set 
of categorical attribute values into a set of values from inter- 
val [-1, 1], thus they reflect the probability of an occurrence 
of some particular categorical value. Hence, an ordinal scale 
attribute is transformed into an interval scale attribute. In our 
case, all /, are simple binary attributes, and the RIDIT scores, 
denoted , are then just 



Ri(c) 



/,(c) = 1 
-p] /,(c) = 



(14) 



where c € C(G), is the relative frequency of /, being equal to 
1, computed from the entire data set, and - \ - p^.. 

We demonstrate the technique with an example. Let p] be 
equal to 0.95 - almost all of the components have the indicator 
// set. The RIDIT score for some component c, with /,(c) - 1, 
is then just 0.05, as the indicator clearly gives a poor indication 
of fraudulent components. On the other hand, for some compo- 
nent c, with /,(c) - 0, the RIDIT score is -0.95, since the indi- 
cator very likely gives a good indication of the non-fraudulent 
components. Similar intuitive explanation can be made by set- 
ting p] to 0.05. A full discussion of R IDIT scoring is onutted . 



for more details see (Brockett, 1981: Brockett & Levinelll977l) . 

Introduction of RIDIT scoring diminishes previously men- 
tioned second weakness. To also cope with the third, we make 
use of the PRIDIT technique. The intuition of this technique 
is that we can weight indicators in some ensemble by assess- 
ing the agreement of some particular indicator with the entire 
ensemble. We make a (probably incorrect) assumption that in- 
dicators are independent. 

Formally, let W be a vector of weights for the ensem- 
ble of RIDIT scorers R/ for indicators /,, denoted W = 
[wi, W2 . . . w,,]^, and let R he a matrix with /, component 
equal to Rj{c), where c is an component in C(G). Matrix 
product RW gives the ensemble's score for all the components, 
i.e. component in vectoT RW is equal to the weighted linear 
combination of RIDIT scores for component in C(G). De- 
note S - RW, we can then assess indicators agreement with 
entire ensemble as (written in matrix form) 

R^S/\\R^S\\. (15) 

Equation (fTsT i computes normalized scalar products of columns 
of R, which corresponds to the returned values of RIDIT scor- 
ers, and S , which is the overall score of the entire ensemble (for 
each component in C(G)). When the returned values of some 
scorer are completely orthogonal to the ensemble's scores, the 
resulting normalized scalar product equals 0, and reaches its 
maximum, or minimum, when they are perfectly aligned. 

Equation (fTSl l thus gives scorers (indicators) agreement with 
the ensemble and can be used to assign new weights, i.e. W' = 
R^S /\\R^S\\. Greater weights are assigned to the scorers that 
kind of agree with the general belief of the ensemble. Denote 
5' - RW\ then 5' is a vector of overall scores using these 
newly determined weights. There is of course no reason to stop 
the process here, as we can iteratively get even better weights. 
We can write 

R^S'-^ R^RW'-^ 

(16) 



for / > 1 , which can be used to iteratively compute better and 
better weights for an ensemble of RIDIT scorers R/, starting 
with some weights, e.g. - [1, 1 ... 1] - the process con- 
verges to some fixed point no matter the starting weights (due 
to some assumptions). It can be shown that the fixed point is ac- 
tually the first principal component of the matri x R^R denoted 
For more details on PRIDIT technique see ( Brockett et al.l 
2002h . 

We can now score each component in C(G) using the PRIDIT 
technique for indicators /, and output as suspicious all the com- 
ponents, with a score greater than 0. Thus 



S (G) = {c| c e C(G) A /?(c) > 0}, 



(17) 



where R(c) is a row of matrix R, that corresponds to component 
c. Again there is no guarantee, that the threshold is the best 
choice. Still, if we know the expected proportion of fraudulent 
components in the data set (or e.g. expected number of fraudu- 
lent colUsions), we can first rank the components using PRIDIT 
technique and then output only the appropriate proportion of 
most highly ranked components. 



4.3. Suspicious entities detection 

In the third module of the system key entities are detected 
inside each previously identified suspicious component. We 
focus on identifying key participants, that can be later used 
for the identification of other key entities (collisions, vehicles, 
etc.). Key participants are identified by employing Iterative As- 
sessment Algorithm (lAA) that uses intrinsic and relational at- 
tributes of the entities. The algorithm assigns a suspicion score 
to each participant, which corresponds to the likehhood of it 
being fraudulent. 

In classical approaches over flat data, entities are assessed 
using only their intrinsic attributes, thus they are assessed 
in complete isolation to other entities. It has been empir- 
ically shown that the assessment can be improved by also 
considering the related entities, more preci sely, by consider- 
ing d i e assessment of the rela t ed entities dChakrabarti et al , 
1998': 'Domingos & Ric hardsonL bOOll: ILu & Getooii l2003allb ; 
Neville & Jensen, 2000). The assessment of an entity is in- 



ferred from the assessments of the related entities and propa- 
gated onward. Still, incorporating only the intrinsic attributes 
of the related entities gene rally doesn't improve, o r even de 



terior ates, the assessment dChakrabarti et al. 
2000h . 



1998 Oh et al 



W = 



WR^S 



WR'^RW' 



i-u 



The proposed lAA algorithm thus assesses the entities by also 
considering the assessment of their related entities. As these re- 
lated entities were also assessed using the assessments of their 
related entities, and so on, the entire network is used in the as- 
sessment of some particular entity. This could not be achieved 
otherwise, as the formulation would surely be too complex. We 
proceed by introducing lAA in a general form. 

Let c be some suspicious component in network G, c e S (G), 
and let v, be one of its vertices, v,- e Vc- Furthermore, let A^(v,) 
be a set of neighbor vertices of v,- (i.e. vertices at distance 1) 
and V(vi) - N(vi) U {v,}, and let ^(v,) be a set of edges incident 
to Vi (i.e. E{vi) - {e\ e e Ec /\ v/ € e}). Let also en/ he an 



entity corresponding to vertex v, and Nierii), Vierii) be a set of 
entities that corresponds to Niyi), y(v,) respectively. We define 
the suspicion score s, s{-) > 0, for the entity en, as 



s(eni) = AM{s{N{eni)), Viem), V{vi), E{vi)) 
= AM{i, c). 



(18) 



where AM is some assessment model and s{N{eni)) is a set of 
suspicion scores for entities in Nieni). The suspicion of some 
entity is dependent on the assessment of the related entities (first 
argument in equation (fTSTi), on the intrinsic attributes of related 
entities and itself (second argument), and on the relational at- 
tributes of the entity (last two arguments). We assume that AM 
is linear in the assessments of the related entities (i.e. s{N(eni))) 
and that it returns higher values for fraudulent entities. 

For some entity en,-, when the suspicion scores of the related 
entities are known, en, can be assessed using equation ( fTSl l. 
Commonly, none of the suspicion scores are known preliminary 
(as the data set is unlabeled), and the equation thus cannot be 
used in a common manner Still, one can increm entally assess 



the entities in an iterati ve fashion, similar to e.g. (iBrin & Page 
1998HKleinbe"r3.ll999h . 



Let s^{-) be some set of suspicion scores, e.g. «"(■) - 1. We 
can then assess the entities using scores s^(-) and equation ( fTSl l. 
and get better scores s'( )- We proceed with this process, itera- 
tively refining the scores until some stopping criteria is reached. 
Generally, on the A:''' iteration, entities are assessed using 

sHend = AM(/-i(A?(en,)),y(en,),y(v,),£(v,)) (19) 
= AM{i, k, c). 

Note that the choice for i°(-) is arbitrary - the process converges 
to some fixed point no matter the starting scores (due to some 
assumptions). Hence, the entities are assessed without prelimi- 
nary knowing any suspicion score to bootstrap the procedure . 
We present the lAA algorithm below. 

lAA algorithm 



s%) = 1 
k = 1 

WHILE NOT stopping criteria DO 
FOR Vv;, en; DO 

s''(eni) - as'^^^ienj) + {I - a)AM{i, k, c) 
FOR Vv;, en,: y,- non-bucket DO 

normalize s''(eni) 
k^k+l 
RETURN s''(-) 



Entities are iteratively assessed using model AM (a is a 
smoothing parameter set to e.g. 0.75). In order for the pro- 
cess to converge, scores corresponding to non-bucket vertices 
are normalized at the end of each iteration. Due to the fact 
that relations represented by the networks are often not binary, 
there are usually some vertices only serving as buckets that store 
the suspicion assessed at this iteration to be propagated on the 
next. Non-bucket vertices correspond to entities that are actu- 
ally being assessed and only these scores should be normalized 



(for binary relations all the vertices are of this kind). Struc- 
ture of such bucket networks would typically correspond to bi- 
partite graph^ - bucket vertices would only be connected to 
non-bucket vertices (and vice versa). In the case of COPTA 
networks, used in this module of the (prototype) system, bucket 
vertices are those representing collisions. 

One would intuitively run the algorithm until some fixed 
point is reached, i.e. when the scores no longer change. We 
empirically show that, despite the fact that iterative assessment 
does indeed increase the performance, such approach actually 
decreases it. The reason is that the scores over-fit the model. 
We also show, that superior performance can be achieved with 
a dynamic approach - by running the algorithm for d(c) itera- 
tions (diameter of component c). For more see sections |5] |6] 

Note that if each subsequent iteration of the algorithm ac- 
tually increased the performance, one could assess the entities 
directly. When AM is linear in the assessments of related enti- 
ties, the model could be written as a set of linear equations and 
solved exactly (analytically). 

An arbitrary model can be used with the algorithm. We 
propose several linear models based on the observation that in 
many of these bucket networks the following holds: every en- 
tity is well defined with (only) the entities directly connected 
to it, considering the context observed. E.g. in the case of 
COPTA networks, every collision is connected to its partici- 
pants, who are clearly the ones who "define" the collision, and 
every participant is connected with its collisions, which are the 
precise aspect of the participant we wish to investigate when 
dealing with fraud detection. Similar discussion could be made 
for movie-actor, corporate board-director and other well known 
collaboration networks. A model using no attributes of the en- 
tities is thus simply the sum of suspicion scores of the related 
entities (we omit the arguments of the model) 

{Vi,vj\eE(Vi) 

Our empirical evaluation shows that even such a simple model 
can achieve satisfactory performance. 

To incorporate entities' attributes into the model, we intro- 
duce/acfors. These are based on intrinsic or relational attributes 
of entities. The intuition behind the first is that some intrinsic 
attributes' values are highly correlated with fraudulent activity. 
Suspicion scores of corresponding entities should in this case 
be increased and also propagated on the related entities. More- 
over, many of the relational attributes (i.e. labels of the edges) 
increase the likelihood of fraudulent activity - the propagation 
of suspicion over such edges should also be increased. 

Let Iec be the edge labeling function and I-Eg the alphabet of 
all possible edge labels, i.e. = {Driver, Passenger . . .] (for 
COPTA networks). Furthermore, let En be a set of all entities 
en,. We define Fi„,, Frei to be the factors, corresponding to 
intrinsic, relational attributes respectively, as 

Fi,„: £n->[0,(x,), (21) 

F,,, : IIe, ^ [0, (X,). (22) 



"^In the social science literature bipartite graphs are known as collaboration 
networks. 
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Improved model incorporating these factors is then 

AMbas = Fi„,(eni) ^ FreidEaie)) sierij). (23) 



Factors Fin, are computed from (similar for Frei) 



where 



1/(1-/;L(^«,)) fl{end>0 
1 + fLierti) otherwise 



and 



J in, 



£«^(-l,l). 



(24) 



(25) 



(26) 



/ are virtual factors defined by the domain expert. The trans- 



formation with equation (1251 1 is done only to define factors on 
the interval (-1,1), rather than on [0, oo). The first is more intu- 
itive as e.g. two "opposite" factors are now / and -/, / e [0, 1), 
opposed to / and 1 //, / > 0, before. 

Some virtual factor /j*^ can be an arbitrary function defined 
due to a single attribute of some entity, or due to several at- 
tributes formulating correlations between the attributes. When 
attributes' values correspond to some suspicious activity (e.g. 
collision corresponds to some classical scheme), factors are set 
to be close to 1, and close to -1, when values correspond to 
non-suspicious activity (e.g. children in the vehicle). Other- 
wise, they are set to be 0. 

Note that assessment of some participant with models AM^m 
and AMhas is highly dependent on the number of collisions this 
participant was involved in. More precisely, on the number of 
the terms in the sums in equations ( l20t . ( l23l l (which is exactly 
the degree of the corresponding vertex). Although this property 
is not vain, we still implicitly assume we posses all of the colli- 
sions a certain participant was involved in. This assumption is 
often not true (in practice). 

We propose a third model diminishing the mentioned as- 
sumption. Let dc be the average degree of the vertices in net- 
work G, dc - ave{d(vi;)\ vt e Vc)- The model is 



AM" 



do + d{vi) AM. 
2 d(^) 



1 + 



d{vi) ) 



AM. 



(27) 



where AM. can be any of the models AMyaw, AMbas- AM"'""" 
averages terms in the sum of the model AM., and multiplies this 
average by the mean of vertex's degree and the average degree 
over all the vertices in Vc- Thus a sort of Laplace smoothing 
is employed that pulls the vertex degree toward the average, in 
order to diminish the importance of this parameter in the final 
assessment. Empirical analysis in section |5] shows that such a 
model outperforms the other two. 

Knowing scores s{-) for all the entities in some connected 
component c e G, one can rank them according to the suspicion 
of their being fraudulent. In order to also compare the entities 
from various components, scores must be normalized appropri- 
ately (e.g. multiplied with the number of collisions represented 
by component c). 



4.4. Final remarks 

In the previous section (third module of the system) we fo- 
cused only on detection of fraudulent participants. Their sus- 
picion scores can now be used for assessment of other entities 
(e.g. collisions, vehicles), using one of the assessment models 
proposed in section |43] 

When all of the most highly ranked participants in some 
suspicious component are directly connected to each other (or 
through buckets), they are proclaimed to belong to the same 
group of fraudsters. Otherwise they belong to several groups. 
During the investigation process, the domain expert or investi- 
gator analyzes these groups and determines further actions for 
resolving potential fraud. Entities are investigated in the order 
induced by scores s(-). 

Networks also allow a neat visualization of the assessment 
(see Fig.|6]l. 




ooo 




ooo 




ooo 




Fig. 6: Four COPTA networks showing same group of collisions. Size of the 
participants' vertices correspond to their suspicion score; only participants with 
score above some threshold, and connecting collisions, are shown on each net- 
work. The contour was drawn based on the harmonic mean distance to every 
vertex, weighted by the suspicion scores. (Blue) filled collisions' vertices in the 
first network correspond to collisions that happened at night. 



5. Evaluation with tlie prototype system 

We implemented a prototype system to empirically evaluate 
the performance of the proposition. Furthermore, various com- 
ponents of the system are analyzed and compared to other ap- 
proaches. To simulate realistic conditions, the data set used for 
evaluation consisted only of the data, that can be easily (auto- 
matically) retrieved from police records (semistructured data). 
We report results of the assessment of participants (not e.g. col- 
lisions). 
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5.1. Data 

The data set consists of 3451 participants involved in 1561 
collisions in Slovenia between the years 1999 and 2008. The 
set was made by merging two data sets, one labeled and one 
unlabeled. 

The first, labeled, consists of collisions corresponding to pre- 
viously identified fraudsters and some other participants, which 
were investigated in the past. In a few cases, when class of a 
participant could not be determined, it was set according to the 
domain expert's and investigator's belief. As the purpose of our 
system is to identify groups of fraudsters, and not some iso- 
lated fraudulent collisions, (almost) all isolated collisions were 
removed from this set. It is thus a bit smaller (i.e. 211 partici- 
pants, 91 collisions), but still large enough to make the assess- 
ment. 

To achieve a more realistic class distribution and better statis- 
tics for PRIDIT analysis, the second larger data set was merged 
with the first. The set consists of various collisions chosen (al- 
most) at random, although some of them are still related with 
others. Since random data sampling is not advised for relational 
data dJense ill999|), this set is used explicitly for PRIDIT analy- 
sis. Both data sets consist of only standard collisions (e.g. there 
are no chain collisions involving numerous vehicles or coaches 
with many passengers). 

Class distribution for the data set can be seen in table |2] 

Class distribution 

Count Proportion 

Fraudster 46 0% 21.8% 

Non-fraudster 165 4.8% 78.2% 

Unlabeled 3240 93.9% 

Table 2: Class distribution for the data set used in the analysis of the proposed 
expert system. 



The entire assessment was made using the merged data set, 
while the reported results naturally only correspond to the first 
(labeled) set. Note that the assessment of entities in some con- 
nected component is completely independent of the entities in 
other components (except for PRIDIT analysis). 

5.2. Results 

Performance of the system depends on random generation of 
networks, used for detection of suspicious components (second 
module). We construct 200 random networks for each indicator 
and each component (equations (fTTT i. (fT2l i). however, the re- 
sults still vary a little. The entire assessment was thus repeated 
20 times and the scores were averaged. To assess the ranking 
of the system, average AUC {Area Under Curve) scores were 
computed, AUC. Results given in tables |5]|6l [TIE] are all AUC. 

In order to obtain a standard for other analyses, we first 
report the performance of the system that uses PRIDIT anal- 
ysis for fraudulent components detection, and lAA algorithm 
with model AM'J^^^"^ for fraudulent entities detection, denoted 
lAAII^"" (see tableO. Various metrics are computed, i.e. clas- 
sification accuracy (CA), recall (true positive rate), precision 



Golden standard 

CA 0.8720 

Recall 0.8913 

Precision 0.6508 

Specificity 0.8667 

Fl score 0.7523 

AUC 0.9228 

Table 3: Performance of the system that uses PRIDIT analysis with lAA'l^™" 
algorithm. Various metrics are reported; all except AUC are computed so the 
total cost (on the first run) is minimal. 



(positive predictive value), specificity (\- false positive rate), 
Fl score (harmonic mean of recall and precision) and AUC. 
All but last are metrics that assess the classification of some ap- 
proach, thus a threshold for suspicion scores must be defined. 
We report the results from the first run that minimize the to- 
tal cost, assuming the cost of misclassified fraudsters and non- 
fraudsters is the same. Same holds for confusion matrix seen in 
tables 

Confusion matrix 

Suspicious Unsuspicious 
Fraudster 41 5 

Non-fraudster 22 143 

Table 4: Confusion matrix for the system that uses PRIDIT analysis with 
MA"™" algoiithm (determined so the total cost on the first run is minimal). 



We proceed with an in-depth analysis of the proposed lAA 
algorithm. Table |5] shows the results of the comparison of 
different assessment models, i.e. lAAraw, lAAhas, lAA'"^^" 
and MA™™". Factors for models lAAba., and MA^'J™ (equa- 
tion (|25]) ) were set by the domain expert, with the help of sta- 
tistical analysis of data from collisions. To further analyze the 
impact of factors on final assessment, an additional set of fac- 
tors was defined by the authors. Values were set due to authors' 
intuition; corresponding models are /AA„„ and /AA™™". Re- 
sults of the analysis can be seen in table|6] 

Assessment models 
~ PRIDIT 

J A A J A A , J A A""'"" TTTmemT' 

l/±/iraw i^^bas ^^^^raw has 

0.8872 09145 0.8942 0.9228 

Table 5: Comparison of different assessment models for lAA algorithm (after 
PRIDIT analysis). 

As already mentioned, the performance of the lAA algorithm 
depends on the number of iterations made in the assessment 
(see section l431 l. We have thus plotted the AUC scores with 
respect to the number of iterations made (for the first run), in 
order to clearly see the dependence; plots for /AA"J°", lAA'l^^^" 
can be seen in Fig.|7] Fig.[8]respectively. We also show that su- 
perior performance can be achieved, if the number of iterations 
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Factors 



ALL 


J A A mean 


lAA""-"" 

ml 


has 


0.8188 


0.8435 


0.8787 


PRIDIT 


T\ A mean 
lA^raw 


lAAmean 


lAA""'"" 

has 


0.8942 


0.9086 


0.9228 



Table 6: Analysis of the impact of factors on the final assessment (on all the 
components and after PRIDIT analysis). 



is set dynamically. More precisely, the number of iterations 
made for some component c e C(G) is 



max{dc, d{c)]. 



(28) 



where die) is the diameter of c and dc the average diameter over 
all the components. All other results reported in this analysis 
used such a dynamic setting. 




networks - betweenness centrality (BetCen), closeness central 
ity (CloCen), dis tance centrality ( DisCe n) and eigenvector cen 
trality (EigCen) (iFreemanl 1 1 977[ 1 1 979b . They are defined as 

vj,v,(v;) 



BetCen{vi) 

CloCen(vi) 

DegCen(vi) 
EigCen(vi) 



E 

1 

d(vi) 

tic - I' 



- u. 



EigCen(vj), 



(29) 

(30) 

(31) 
(32) 



where «c is the number of vertices in component c, «c — \Vc\, 
/I is a constant, g,,^ ,,j is the number of geodesies between ver- 
tices Vj and vit and g,,^,,,( (v,) the number of such geodesies that 
pass through vertex v , / ^ / ^ k. Fo r further discussion see 
(Freeman, 1 977. . 1 9791: lisfewmani I200I . 

These measures of centrality were used to assign suspicion 
score to each participant; scores were also appropriately nor- 
malized as in the case of lAA algorithm. For a fair comparison, 
measures were compared against the model that uses no intrin- 
sic attributes of entities, i.e. lAA"]^",". The results of the analysis 
are shown in table |7] 

lAA algorithm 

ALL 



BetCen CloCen DegCen EigCen 



TAAinean 
^^^ran- 



0.6401 0.8138 0.7428 0.7300 0.8188 



Fig. 7: AUC scores with respect to the number of iterations made in the lAA 
algorithm. SoHd curves coiTespond to MA™" algorithm after PRIDIT analysis 
and dashed curves to lAA"'™," algorithm ran on all the components. Straight 
line segments show the scores achieved with dynamic setting of the number of 
iterations (see text). 



PRIDIT 



BetCen CloCen DegCen EigCen 



J A A mean 



0.6541 0.8158 0.8597 0.8581 0.8942 

Table 7: Comparison of the lAA algorithm against several well known measures 
for anomaly detection in networks (on all the components and after PRIDIT 
analysis). For a fair comparison, no intrinsic attributes are used in the lAA 




0,80 4i 1 1 1 i 1 1 1 i i 

10 20 30 40 50 60 70 80 30 100 

AUC Iterations 

Fig. 8: AUC scores with respect to the number of iterations made in the lAA 
algorithm. Solid curves coiTespond to MA^™" algorithm after PRIDIT analysis 
and dashed curves to lAA'l^^^" algorithm ran on all the components. Straight 
line segments show the scores achieved with dynamic setting of the number of 
iterations (see text). 

Due to the lack of other expert system approaches for detect- 
ing groups of fraudsters, or even individual fraudsters (accord- 
ing to our knowledge), no comparative analysis of such kind 
could be made. The proposed lAA algorithm is thus compared 
against several well known measures for anomaly detection in 



Next, we analyzed different approaches for detection of 
fraudulent components (see table |8]l. The same set of 9 indi- 
cators was used for the majority voter (equation (IT3l ) and for 
(P)RIDIT analysis (equation (fTTIi). For the latter, we use a vari- 
ant of random undersampling (RUS), to cope with skewed class 
distribution. We output most highly ranked components, thus 
the set of selected components contain 4% of all the collisions 
(in the merged data set) Analyses of automobile insurance fraud 
mainly agree that up to 20% of all the collisions are fraudu- 
lent, and up to 20% of the latter correspond to non-opportunistic 
fraud (various resources). However, for the majority voter, such 
an approach actually decreases performance - we therefore re- 
port results where all components, with at least half of the indi- 
cators set, are selected. 

Several individual indicators, achieving superior perfor- 
mance, are also reported. Indicator IseiCen is based on be- 
tweenness centrality (equation (l30ll). iMinCov on minimum ver- 



12 



tex cover and on I measure defined as the harmonic mean 
distance between every pair of vertices in some component c, 

I'' = 1 / y d(vi,vj)-\ (33) 



Fraudulent components 



^MinCov 




hetCen MAJOR 


RIDIT 


PRIDIT 


ALL 


0.6019 


0.6386 


0.6774 0.7946 


0.6843 


0.7114 


^MinCov 


//-> 


lBe,Ce„ MAJOR 


RIDIT 


PRIDIT 


lAA'r'" 

has 


0.6119 


0.8494 


0.8549 0.8507 


0.9221 


0.9228 



Table 8: Comparison of different approaches for detection of fraudulent com- 
ponents (prior to no fraudulent entities detection and MA"™"). 

We last analyze the importance of proper data representa- 
tion for detection of groups of fraudsters - the use of networks. 
Networks were thus transformed into flat data and some stan- 
dard unsupervised learning techniques were examined (e.q. k- 
means, hierarchical clustering). We obtained no results com- 
parable to those given in table [3] 

Furthermore, we tested nine standard supervised data-mining 
techniques to analyze the compensation of data labels for the 
inappropriate representation of data. We used (default) im- 
plementations of cla ssifiers in Orange data-mining software 
jpemsar et al.[ 12004 ') and 20-fold cross validation was em- 
ployed as the validation technique. Best performance, up to 
AUC ~ 0.86, was achieved with Naive Bayes, support vec- 
tor machines, random forest and, interestingly, also k-nearest 
neighbors classifier Scores for other approaches were below 
AUC — 0.80 (e.g. logistic regression, classification trees, etc.). 

6. Discussion 

Empirical evaluation from the previous section shows that 
automobile insurance fraud can be detected using the proposi- 
tion. Moreover, the results suggest that appropriate data repre- 
sentation is vital - even a simple approach over networks can 
detect a great deal of fraud. The following section discusses the 
results in greater detail (in the order given). 

Almost all of the metrics obtained with PRIDIT analysis and 
lAA'l^'^^"" algorithm, golden standard, are very high (table |3). 
Only precision appears low, still this results (only) from the 
skewed class distribution in the domain. The Fl measure is 
consequently also a bit lower, else the performance of the sys- 
tem is more than satisfactory. The latter was confirmed by the 
experts and investigators from a Slovenian insurance company, 
who were also pleased with the visual representation of the re- 
sults. 

The confusion matrix given in table|4]shows that we correctly 
classified almost 90% of all fraudsters and over 85% of non- 
fraudsters. Only 5 fraudsters were not detected by the proto- 
type system. We thus obtained a particularly high recall, which 



is essential for all fraud detection systems. The majority of un- 
labeled participants were classified as unsuspicious (not shown 
in tablelH, but the corresponding collisions are mainly isolated 
and the participants could have been trivially eliminated any- 
way (for our purposes). 

We proceed with discussion of different assessment models 
(table |5]). Performance of the simplest of the models lAAraw, 
which uses no domain expert's knowledge, could already prove 
sufficient in many circumstances. It can still be significantly 
improved by also considering the factors, set by the domain 
expert (model lAAhas)- Model lAA""™ further improves the 
assessment of both (simple) models, confirming the hypothe- 
sis behind it (see section 1431) . Although the models (probably 
incorrectly) assume that the fraudulence of an entity is linear 
(in the fraudulences of the related entities), they give a good 
approximation of the fraudulent behavior 

The analysis of the factors used in the models confirms 
their importance for the final assessment. As expected, 
model MA™™" outperforms MA™™", and the latter outperforms 
/AA™*™ (table |6]). First, this confirms the hypothesis that do- 
main knowledge can be incorporated into the model using fac- 
tors (as defined in section 14.3b . Second, it shows that better 
understanding of the domain can improve assignment of fac- 
tors. Combination of both makes the system extremely flexible, 
allowing for detection of new types of fraud immediately after 
they have been noticed by the domain expert or investigator 

As already mentioned, running the lAA algorithm for too 
long over-fits the model and decreases algorithm's final perfor- 
mance (see Fig. |7l Fig. [8] note different scales used). Early 
iterations of the algorithm still increase the performance in all 
cases analyzed, which proves the importance of iterative assess- 
ment as opposed to single-pass approach. However, after some 
particular number of iterations has been reached, performance 
decreases (at least slightly). Also note that the decrease is much 
larger in the case of AM".'J"" model flian AM™™", indicating that 
the latter is superior to the first. We propose to use this decrease 
in performance as an additional evaluation of any model used 
with lAA, or similar, algorithm. 

It is preferable to run the algorithm for only a few iterations 
for one more reason. Networks are often extremely large, es- 
pecially when they describe many characteristics of entities. In 
this case, running the algorithm until some fixed point is sim- 
ply not feasible. Since the prototype system uses only the basic 
attributes of the entities, the latter does not present a problem. 

The number of iterations that achieves the best performance 
clearly depends on various factors (data set, model, etc.). Our 
evaluation shows that superior, or at least very good, perfor- 
mance (Fig.|2l Fig.[8]l can be achieved with the use of a dynamic 
setting of the number of iterations (equation (|28]|). 

When no detection of fraudulent components is done, the 
comparison between lAA algorithm and measures of central- 
ity shows no significant difference (table|7]i. On the other hand, 
when we use PRIDIT analysis for fraudulent components de- 
tection, the lAA algorithm dominates others. Still, the results 
obtained with DegCen and EigCen are comparable to those ob- 
tained with supervised approaches over flat data. This shows 
that even a simple approach can detect a reasonably large por- 
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tion of fraud, if appropriate representation of data is used (net- 
works). 

The analysis of different approaches for detection of fraudu- 
lent components produces no major surprises (table|8) - the best 
results are obtained using (P)RIDIT analysis. Note that a single 
indicator can match the performance of majority classifier MA- 
JOR, confirming its naiveness (see section 1431 1: exceptionally 
high AUC score obtained by MAJOR, prior to no fraudulent en- 
tities detection, only results from the fact, that the returned set 
of suspicious components is almost 10-times smaller than for 
other approaches. The precision of the approach is thus much 
higher, but for the price of lower recall (useless for fraud detec- 
tion). 

We have already discussed the purpose of hierarchical detec- 
tion of groups of fraudsters - to simplify detection of fraud- 
ulent entities with appropriate detection of fraudulent compo- 
nents. Another implication of such an approach is also simpler, 
or is some cases even feasible, data collection process. As the 
detection of components is done using only the relations be- 
tween entities (relational attributes), a large portion of data can 
be discarded without knowing the values of any of the intrinsic 
attributes. This characteristic of the system is vital when de- 
ploying in practice - (complete) data often cannot be obtained 
for all the participants, due to sensitivity of the domain. 

Last, we briefly discuss the applicability of the proposition 
in other domains. The presented lAA algorithm can be used for 
arbitrary assessment of entities over some relational domain, 
exploring the relations between entities with no demand for an 
(initial) labeled data set. When every entity is well defined with 
(only) the entities directly related to it, considering the context 
observed, one of the proposed assessment models can also be 
used. Furthermore, the presented framework (four modules of 
the system) could be employed for fraud detection in other do- 
mains. The system is also applicable for use in other domains, 
where we are interested in groups of related entities sharing 
some particular characteristics. The framework exploits the re- 
lations between entities, in order to improve the assessment, 
and is structured hierarchically, to make it applicable in prac- 
tice. 

7. Conclusion 

The article proposes a novel expert system approach for de- 
tection of groups of automobile insurance fraudsters with net- 
works. Empirical evaluation shows that such fraud can be ef- 
ficiently detected using the proposition and, in particular, that 
proper representation of data is vital. For the system to be appli- 
cable in practice, no labeled data set is used. The system rather 
allows the imputation of domain expert's knowledge, and it can 
thus be adopted to new types of fraud as soon as they are no- 
ticed. The approach can aid the domain investigator to detect 
and investigate fraud much faster and more efficiently. More- 
over, the employed framework is easy to implement and is also 
applicable for detection (of fraud) in other relational domains. 

Future research will be focused on further analyses of dif- 
ferent assessment models for lAA algorithm, considering also 
the nonlinear models. Moreover, the lAA will be altered into 



an unsupervised algorithm, learning the factors of the model 
in an unsupervised manner during the actual assessment. The 
factors would thus not have to be specified by the domain ex- 
pert. Applications of the system in other domains will also be 
investigated. 
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